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Abstract 

The stochastic block model (SBM) is a mixture model used for the clustering of 
nodes in networks. It has now been employed for more than a decade to analyze 
very different types of networks in many scientific fields such as Biology and social 
sciences. Because of conditional dependency, there is no analytical expression for the 
posterior distribution over the latent variables, given the data and model parameters. 
Therefore, approximation strategies, based on variational techniques or sampling, 
have been proposed for clustering. Moreover, two SBM model selection criteria exist 
for the estimation of the number K of clusters in networks but, again, both of them 
rely on some approximations. In this paper, we show how an analytical expression 
can be derived for the integrated complete data log likelihood. We then propose 
an inference algorithm to maximize this exact quantity. This strategy enables the 
clustering of nodes as well as the estimation of the number clusters to be performed 
at the same time and no model selection criterion has to be computed for various 
values of K . The algorithm we propose has a better computational cost than existing 
inference techniques for SBM and can be employed to analyze large networks with 
ten thousand nodes. Using toy and true data sets, we compare our work with other 
approaches. 

Keywords: Random graphs, stochastic block models, integrated classification 
likelihood. 

1. Introduction 

There is a long history of research on networks which goes back to the earlier work 
of Moreno pp. Because they are simple data structures yet capable of representing 
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complex systems, they are used in many scientific fields [21 E] • Originally considered 
in social sciences jlj to characterize relationships between actors (5], E], networks 
are also used to describe neural networks [7], powergrids [8], and the Internet P, 
[TO] . Other examples of real networks can be found in Biology with the use of 
regulatory networks to describe the regulation of genes by transcriptional factors 
[TT] or metabolic networks to represent pathways of biochemical reactions [12J. As 
the number of networks used in practice has been increasing, a lot of attention has 
been paid on developing graph clustering algorithms to extract knowledge from their 
topology. Existing methods usually aim at uncovering very specific patterns in the 
data, namely communities or disassortative mixing. For an exhaustive review, we 
refer to [13J. 

Most graph clustering algorithms look for communities, where two nodes of the 
same community are more likely to be connected than nodes of different commu- 
nities. These techniques [TU [15] often maximize the modularity score proposed by 
Girvan and Newman [TB] for clustering. However, recent work of Bickel and Chen 
[TT] showed that they were asymptotically biased and tended to lead to the discovery 
of an incorrect community structure, even for large graphs. Alternative strategies, 
see for instance |18) . are generally related to the probabilistic model of Handcock, 
Raftery and Tantrum [J_H] which generalizes the work of Hoff, Raftery and Handcock 
[2D] . Nodes are first mapped into a a latent space and then clustered depending on 
their latent positions. Community structure algorithms are commonly used for af- 
filiation network analysis. As mentioned in [21], other graph clustering algorithms 
aim at uncovering dissasortative mixing in networks where, contrary to community 
structure, nodes mostly connect to nodes of different clusters. They are particularly 
suitable for the analysis of bipartite or quasi bipartite networks [22J . 

In this paper, we consider the stochastic block model (SBM) proposed by Nowicki 
and Snijders [23] which is a probabilistic generalization [H [5] of the work of White, 
Boorman and Breiger [23]. As pointed out by Daudin, Picard and Robin [25], SBM 
can be seen as a mixture model for graphs. It assumes that nodes are spread into 
K clusters and uses a K x K matrix II to describe the connection probabilities 
between pairs of nodes. No assumption is made on II such that very different 
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structures can be taken into account. In particular, as shown in [2E], contrary to 
the methods mentioned previously, SBM can be used to retrieve both communities 
and disassortative mixing in networks. 

Many extensions have been developed to overcome some limits of the standard 
SBM. For example, Mariadassou, Robin and Vacher [27] introduced recently a prob- 
abilistic framework to deal with valued edges, allowing covariates to be taken into 
account. While the first model they proposed explains the value of an edge, between 
a pair of nodes, through their clusters only, the second and third approaches do ac- 
count for covariates through Poisson regression models. This framework is relevant 
in practice because extra information on the edges is sometimes available, such as 
phylogenetic distances in host-parasite networks or amounts of energy transported 
between nodes in powergrids. 

Another drawback of SBM is that it assumes that each node belongs to a sin- 
gle cluster while many objects in real world applications belong to several groups 
or communities [2B]. To tackle this issue Airoldi et al. [2jJ] proposed the mixed 
membership stochastic block model (MMSBM) [301 El]. A latent variable 7Tj, drawn 
from a Dirichlet distribution, is associated to each node i of a network. Given a 
pair of nodes, two binary latent vectors Z$_^- and Z^- are then considered. 
The vector Z^- is assumed to be sampled from a multinomial distribution with pa- 
rameters (1, 7Ti) and describes the cluster membership of i in its relation towards j. 
By symmetry, Zi«_j is drawn from multinomial distribution with parameters (1, 7Tj) 
and characterizes the cluster membership of j in its relation towards i. Thus, in 
MMSBM, since each node can have different latent vectors through its relations 
towards other nodes, it can belong to several clusters. The connection probability 
between i and j is finally given by Pij = ZT -BZ^-. The overlapping stochastic 
block model (OSBM) was proposed by Latouche, Birmele and Ambroise [2B] as an 
alternative probabilistic model for networks allowing overlapping clusters. Contrary 
to MMSBM, edges are influenced by the fact that some nodes belong to multiple 
clusters. Thus, each node i is characterized by a binary latent vector Zj sampled 
from a product of Bernoulli distributions. An edge between nodes i and j is then 
drawn from a Bernoulli distribution with parameter Pij= g ( az . z .)■ The function g(-) 
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is the logistic function while az 4 ,z- is a real variable describing the interactions be- 
tween the nodes, depending on the different clusters they are associated with. It 
is given by az u z t = ZjWZj + Z^U + V T Zj + W*. Finally, we mention the work 
of Karrer and Newman [32] who proposed an interesting extension of SBM to deal 
with node degree heterogeneity inside clusters. The model deals with valued edges 
and includes another set of parameters describing vertices attractiveness. Using the 
right constraints the model is identifiable (up to permutations of clusters) and the 
atractivity parameters can be directly related to vertices degree. This work was 
extended to oriented networks in |33j and finally tools for model selection between 
different models are derived in [34J. 

In this paper, we wont consider the extensions for SBM we mentionned, we will 
rather focus on the standard SBM. Our goal here is not to propose new extensions, 
allowing a SBM like model to be applicable on specific types of networks, or to 
introduce new latent structures. Conversely, considering the standard SBM, which 
has been widely used in practice for network analysis, for more than a decade, we 
aim at developping a new optimization procedure, improving over existing inference 
strategies. In SBM, the posterior distribution over the latent variables, given the 
parameters and the observed data, cannot be factorized due to conditional depen- 
dency. Therefore, optimization techniques such as the expectation maximization 
(EM) algorithm cannot be used directly for clustering. To tackle this issue, Daudin, 
Picard and Robin [25J proposed an approximation method based on a variational 
EM algorithm. Note that an online version of this algorithm exists [33] • A Bayesian 
framework was also considered by Nowicki and Snijders [23J where conjugate priors 
for the model parameters were introduced. Again, because the posterior distribu- 
tion over the model parameters, given the data, is not tractable, approximation 
techniques were employed for inference. Thus, Nowicki and Snijders [23J used a 
Gibbs sampling procedure while Latouche, Birmele and Ambroise [36J relied on a 
variational Bayes EM algorithm. To our knowledge, only two model selection cri- 
teria, the integrated classification likelihood (ICL) and the integrated likelihood 
variational Bayes (ILvb) have been developed for SBM in order to estimate the 
number K of clusters in networks. Standard criteria such as the Akaike information 
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criterion (AIC) or bayesian information criterion (BIC) cannot be used because they 
rely on the SBM observed data log likelihood which is not tractable in practice (see 
for instance [26]). ICL was originally developed by Biernacki, Celeux and Govaert 
[37] for Gaussian mixture models and then adapted by Daudin, Picard and Robin 
[21)] to SBM. It is based on Laplace and Stirling approximations of the integrated 
complete data log likelihood. As shown in [38], it tends to miss some important 
structures present in the data for small data sample, because of the asymptotic ap- 
proximations. To tackle this drawback, Latouche, Birmele and Ambroise proposed 
in [36] the ILvb criterion which relies on a variational Bayes approximation of the 
integrated observed data log likelihood. 

In this paper, we show how an analytical expression of the integrated complete 
data log likelihood can be obtained in a Bayesian framework and that no asymp- 
totic approximation is required. We call the corresponding criterion ICL ex where 
ex stands for exact. We then propose a greedy inference algorithm which maxi- 
mizes this exact quantity. The strategy has three advantages compared to existing 
approaches. First, it maximizes an analytical criterion directly derived from SBM, 
while variational techniques for instance rely on lower bounds for approximation. 
Thus, the lower bound of the variational EM algorithm proposed by Daudin, Pi- 
card and Robin [23] approximates the observed data log likelihood, while Latouche, 
Birmele and Ambroise |36j introduced a lower bound to estimate the integrated ob- 
served data log likelihood. Second, ICL ex only depends on all the latent binary 
vectors Z i; stored in the matrix Z, and the number K of clusters, not on the model 
parameters which are marginalized out. Therefore, the optimization task focus on 
(K, Z) and is purely combinatorial. When using the Gibbs algorithm [23], the suc- 
cessive samples for Z and model parameters are highly correlated. As a consequence, 
nodes tend to be stuck in clusters, after a few iterations. Similar remarks could be 
made for the variational EM and variational Bayes EM algorithms. In our case, 
because the parameters are marginalized out (collapsed) the method is expected to 
explore more easily the latent space of Z. This property is at the core of collapsing 
methods (for more details, we refer to [39]). Finally, our strategy enables the clus- 
tering of nodes as well as the estimation of the number of clusters to be performed 
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at the same time and no model selection criterion has to be computed for various 
values of K. Starting from a complex model with K = K up clusters, (K up being 
an upper bound for K), the proposed algorithm swaps labels until ICL ex reaches a 
local maximum. During the process, clusters may disappear, i.e. their cardinality 
reaches zero. Such an approach leads to a simple and time attractive algorithm with 
complexity of 0(L + NK^ P ), with L the total number of edges in the network and 
N the number of vertices. 

As we shall see through a series of experiments, the greedy algorithm takes benefit 
of computing the exact ICL and improves over existing methods, both in terms of 
clustering and model selection. It can also deal with large networks with tens of 
thousands of vertices. 

2. The stochastic block model 

We consider a binary network with N nodes represented by its adjacency matrix 
X such that = 1 if there is an edge from node i to node j, otherwise. In this 
paper, we focus on directed networks, i.e. relations are oriented. Therefore X is 
not symmetric. Moreover, we do not consider any self loop, that is an edge from 
a node to itself. We emphasize that all the optimization equations derived in this 
work can easily be adapted to deal with undirected networks or to take into account 
self loops. 

2.1. Model and notations 

The stochastic block model (SBM) introduced by Nowicki and Snijders [23] as- 
sumes that the nodes are spread into K clusters with connectivity patterns described 
by a K x K matrix II. The cluster of each node is given by its binary membership 
vector Zj sampled from a multinomial distribution : 



such that = 1 if i belongs to cluster k and zero otherwise. Contrary to the work 
of Latouche, Birmele and Ambroise [2E], each node belongs to a single cluster, that 
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is X)fe=i = 1, Vi. Given the vectors Z, and Zj, an edge between node i and j is 
then drawn from a Bernoulli distribution with probability Um : 

Xij\z ik Zji — i ~ £>(n H ). 

This leads to a simple yet flexible generative model for networks. First, all the 
vectors Zj are sampled independently. We denote Z the binary N x K matrix 
storring the Z,s as raw vectors : 

N N K 

p(Z\a) = J] M(Z t ; l,a) = nil C 1 ) 

i=l i=l fe=l 

Then, given the latent structure Z, all the edges in X are drawn independently : 

TV 

p(x|z,n)=JJp(x y |z i ,z i ,n) 

AT K 

= [][[fi(I i3 ;n H ) z ^ (2) 

vfij k,l 
N K 



nnO 1 ^ 1 -"" _ 

2.2. Integrated classification likelihood criteria 

In this paper, we consider the integrated complete data log likelihood logp(X, Z\K) 
in order to focus on the inference of Z and K from the observed data X, all the 
SBM parameters (a, It) being integrated out. We first recall existing approxima- 



tions and then show in Section |2.2.2| how an analytical expression of this quantity 
can be derived. 

2.2.1. Asymptotic ICL criterion 

When considering a factorized prior distribution p{a,TH\K) = p(ot\K)p(Yl\K) 
over the model parameters, as in [37J, the integrated complete data log likelihood 
easily decomposes into two terms: 

logp(X, Z\K) = log (^J p(X, Z, n, a\K)dadn 

= log^y p(K\Z,U,K)p{U\K)dU J p{Z\a,K)p{a\K)do^j ( 3 ) 
= logp(X|Z,fO + logp(Z|fO. 



However, for an arbitrary choice of the priors p(a\K) and p(TH\K), the marginal 
distributions p(X|Z, K) as well as p(Z\K) are usually not tractable and ^ does 
not have any analytical form. To tackle this issue, Daudin, Picard and Robin [25] 
proposed an asymptotic approximation of logp(X, Z\K), so called integrated classi- 
fication likelihood criterion (ICL). Note that ICL was originally proposed by Bier- 
nacki, Celeux and Govaert [37] for Gaussian mixture models. It was then adapted 
by Biernacki, Celeux and Govaert [38] to mixtures of multivariate multinomial dis- 
tributions and to the SBM model by Daudin, Picard and Robin |25J . In the case we 
consider of a directed graph without self-loop, ICL is given by: 

ICL(Z,K) « \ogp(X,Z\K) 

= maxlogp(X,Z|a,n,iT) - -K 2 \og (N{N - 1)) - — -^log(iV). 
r*.n 2 2 

For an extensive description of the use of Laplace and Stirling approximations to 
derive the ICL criterion, we refer to [37]. Since it approximates the integrated 
complete data log likelihood, ICL is known to be particularly suitable when the focus 
is on the clustering task and not on the estimation of the data density. However, 
as shown in [381 E7] , it tends to miss some important structures present in the data 
because of the (asymptotic) approximations. 

We emphasize that ICL is only used in the literature as a model selection cri- 
terion. In practice, a clustering method such as an EM like algorithm for instance 
is employed to obtained some estimates Z of Z, for various values of the number 
K of classes. ICL is then computed for every pair (Z,K) and the pair (Z*,K*) is 
chosen such that the criterion is maximized. Thus, ICL is optimized only through 
the results (Z,K) produced by the clustering algorithm. Conversely, after having 
given an analytical expression ICL ex of the integrated complete data log likelihood 
in the next section, we will show in Section [3] how to optimize directly ICL ex with 
respect to Z and K. 

2.2.2. Exact ICL criterion 

We rely on the same Bayesian framework as in [23] and [26J. Thus, we consider 
non informative conjugate priors for the model parameters ex and II. Since ex, 
describing the cluster proportions, parametrizes a multinomial distribution ([I]), we 
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rely on a Dirichlet prior distribution: 

p(ct) = Dir (a; n° = (nj, . . . , n^)) . 

A common choice consists in fixing the hyperparameters to 1/2, i.e. n k = 1/2, Vfc. 
Such a distribution corresponds to a non informative Jeffreys prior which is known 
to be proper [ID]. A uniform distribution can also be obtained by setting the hyper- 
parameters to 1. 

Moreover, since the presence or absence of an edge between nodes is sampled 
from a Bernoulli distribution, we consider independent Beta prior distributions to 
model the connectivity matrix II: 

K 
k,l 

Again, if no prior information is available, all hyperparameters rj kl and ( kl can be 
set to 1/2 or 1 to obtain a Jeffreys or uniform distribution. 

With these choices of conjugate prior distributions over the model parameters, 
the marginal distributions p(X|Z, K) as well as p(Z\K) in ^ have analytical forms, 



and so has the integrated complete data log likelihood, as proved in |Appendix A 
We call ICL ex the corresponding criterion, where ex stands for exact. It is given 
by: 

ICL ex (Z,K) =logp(X,Z\K) 

— / r(^ + OTi7«)r(c«)\ , ,„ /r(EL«2)nf =1 rw' 



k,l 

where the components n k are: 



4r 8 \nv k i+Cki)v(v o kl nc° k i)J g 



r(Ef =1 ^)nf =1 r^" 



n. 



N 

n k = n° k + J2 Z ik,^ke {1,...,K}, 

i=l 

and can be seen as pseudo counters of the number of nodes in each class. Moreover, 
the parameters (r] k[ , ( k i) are given by: 

N 

m = v°m + ZikZflXij^ik, e {l, . . . , k}\ 
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and 

N 

Cm = Cm + X; Z lk Z 3l (l - Xij),V{k, /)€{!,..., K} 2 . 

They represent pseudo counters of the number of edges and non-edges connecting 
nodes of class k to nodes of class I, respectively. 

Note that the ICL ex criterion is related to the variational Bayes approximation 
of the integrated observed data log likelihood logp(X.\K) proposed by Latouche, 
Birmele and Ambroise |36j. The key difference is that the parameters (rik, i]ki, Cm) in 
ICL ex depend on the hard assignment Z of nodes to classes and not on approximated 
posterior probabilities r. Moreover, ICL ex does not involve any entropy term as in 

3. Greedy optimization 

Since the model parameters have been marginalized out, the ICL ex criterion only 
involves the cluster indicator matrix Z whose dimensionality depends on the number 
K of clusters. Thus, this integrated likelihood is only a function of a partition V, 
i.e. an assignment of the vertices to clusters. Looking directly for a global maximum 
of ICL ex is not feasible because it involves testing every possible partition of the 
vertices with various values of K. However, this is a combinatorial problem for which 
heuristics exist to obtain local maxima. In this paper, we rely on greedy heuristics 
which have been shown to scale well with sample sizes [14] . These approaches have 
already been used for graph clustering using ad-hoc criteria such as modularity 
[HJ Hi] and are reminiscent of the well known iterated conditional modes algorithm 
of Besag |42j used for maximum a posteriori estimation in Markov random fields. 

The algorithm (see Algorithm[TJ starts with a SBM model with K = K up clusters, 
K up being an upper bound for the number of clusters. K up is assumed to be given 
as an input along with aiVx K up matrix Z. In practice, K up is set to a large value 
using user knowledge on the problem at hand, while Z can be initialized with the 
methods described in the next section. The algorithm then cycles randomly through 
all the vertices of the network. At each step, a single node % is considered while all 
the membership vectors Zj for j ^ % are hold fixed. If % is currently in cluster g, 
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the method looks for every possible label swapping, i.e. removing i from cluster g 
and assigning it to a cluster h ^ g, and computes the corresponding change A g ^h 



in the ICL ex criterion. Note that A 5 _>h takes two forms (see Appendix B) whether 
cluster g is empty after removing i or not. If no label swapping induces an increase 
of the criterion, the vector Zj is not modified. Otherwise, the label swapping with 
the maximal increase is applied and Z« is changed accordingly. During the process, 
clusters may disappear, i.e. their cardinality reaches zero. Each time one of these 
moves is accepted, the model is updated and the corresponding column is removed 
from the cluster indicator matrix Z. Finally, the algorithm stops if a complete 
pass over the vertices did not lead to any increase of the ICL ex criterion. Thus, the 
algorithm, automatically infers the number of clusters while clustering the vertices of 
the network. Starting with an over-segmented initial solution our approach simplifies 
the model until a local maximum is reached. 

3.1. Complexity 

In order to set up such an algorithm, it is sufficient to know how to compute the 
changes in the ICL ex criterion induced by the possible swap movements (from cluster 
g to cluster h) for a given node i, the others being kept fixed. Such changes can be 



computed efficiently (see Appendix B for details) and the complexity of finding the 
best swap movement for a node is in average 0(l+K 2 ), where lis the average number 
of edges per node. Such complexity can be achieved, since good approximations of 
the logarithm of the gamma function are available with constant running time. The 
greedy algorithm has therefore a total complexity of 0(N{1 + K 2 p ) + L), since a 
swap movement cost is 0(1 + K 2 ); the initialization of the edges counters (rjkhCkl) 
cost is L (the total number of edges in the graph) and several complete passes 
over the set of nodes will be performed (typically less than 10). Eventually, this 
can be simplified in 0(NK 2 p + L) and compared to the complexity of 0(LK^ p ) 
achieved using a variational algorithm and a model selection criterion as in [251 [36] . 
Indeed, contrary to our approach which estimates the number of clusters in a single 
run, while clustering the nodes, these approaches are run multiple times for various 
values of K and K* is chosen such that the corresponding model selection criterion 
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Algorithm 1: Greedy ICL 
Set K = K up ; stop = ; 

Initialize the N x K up matrix Z ; Compute ?7,C,n ; 
while stop ^ 1 do 

U = {l,...,iV};stop = l; 
while V not empty do 

Select a node i randomly in V ; Remove i from V ; 
If i is in cluster g, compute all terms A^^V/j ^ g ; 
if at least one A g ^h is positive then 
stop = ; 

Find h such that A g _^ is maximum ; 
Swap labels of i : Z ig = and = 1 ; 
if g is empty then 

Remove column g in Z ; Set K — K — 1 ; 
end 

Update rows and columns (g, h) of the matrices f] and £ ; 
Update the components g and h of vector n; 
end 
end 
end 

Result: (Z,K) 

is maximized. Since each run costs 0(LK 2 ), the overall complexity is 0{LK\ v ). 

3.2. Initialization and restarts 

Several solutions are possible for initializing the algorithm, a simple choice con- 
sisting in sampling random partitions while a more relevant though expensive start- 
ing point can be obtained with the k-means algorithm. One possible trade-off in 
terms of computational burden is to use only few iterations of k-means. We used 
the latter method in all the experiments that we carried out. Moreover, since our 
method is only guaranteed to reach a local optima, a common strategy is to run 
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the optimization algorithm with multiple initializations and to keep the best one 
according to the ICL ex criterion. 

3.3. Hierarchical clustering 

Eventually, in a final step, it is possible to check that merge movements between 
clusters do not induce any increase of the objective function. This can be done with a 



greedy hierarchical algorithm which costs 0(K 3 ) (see details in Appendix C). Since 
the labels swap algorithm usually greatly reduces the number of clusters (K << 
K up ), the computational cost of this last step is low. 

Such a scheme leads to a fast algorithm: sparse networks take about 15 seconds 
for iV = 500 nodes, and about five minutes for N = 5000 with a naive Matlab 
implementation. 

4. Experiments on synthetic data 

To assess the greedy optimization method, a simulation study was performed and 
the proposed solution was compared with available implementations of algorithms 
for SBM inference: 

• vbmod, [43], a variational-based approach dedicated to the search of commu- 
nity structures, implemented in Matlab and C. The random graph model they 
considered can be seen as a constrained SBM where all terms on the diagonal 
of the connectivity matrix II are set to a unique parameter A and off-diagonal 
terms to another parameter e, 

• mixer, [25], another variational approach but one which can deal with all 
types of SBM models (not only communities structures) implemented in R 
and C, 

• colsbm, [44], a collapsed Gibbs sampler for SBM in C. The last version of the 
code is used in this experimental section. It involves an additional move type 
compared to the algorithm described in the associated publication. This move 
was found to greatly enhance the results. 
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Our goal here is to evaluate the ability of the different solutions to recover a 
simulated clustering without knowing the number of clusters. Only a reasonable 
upper bound K up on K will be provided to the algorithms when needed. We recall 
that the variational methods optimize a lower bound for various values of K and 
select K* such that a model selection criterion is maximized: ICL for mixer and 
ILvB for vbmod. Conversely, the collapsed Gibbs sampler automatically provides 
an estimate of K, since the posterior of K is made available. 

The performances are assessed in terms of normalized mutual information (see 
for instance [45] ) between the estimated cluster membership matrix Z e and the 
simulated one Z s . The mutual information J(Z e , Z s ) between two partitions is to 
this end defined by: 

/(^z*)=f>,iog(^y (4) 

with 

TV , N , N 



Pkl ~ jv Z ik Z h Pk ~ ]y ^2 Z u» p i ~ n Z * 

i,j i=l i=l 

The measure I(Z e , Z s ) describes how much is learnt about the true partition if the 
estimated one is known, and vice versa. The mutual information is not an ideal 
similarity measure when the two partitions have a different number of clusters and 
it is therefore preferable to use a normalized version of the mutual information such 
as: 

with H(Z) = —J2k=iPk^°&(Pk) and pk = %ih- The performances are eval- 

uated on simulated clustering problems of varying complexity and with different 
settings, in order to give insights about the influence of the number K of clusters, 
of the number of vertices N and of the type of connectivity matrix II. 

4-1. Setting 1: small scale community structures 

The first setting is a classical community simulation with iV = 100 vertices and 
K — 5 clusters. The cluster proportions are set to a = (1/5,1/5,1/5,1/5,1/5) 
and the connectivity matrix takes a diagonal form with off-diagonal elements equal 
to 0.01: Tiki = 0.01, V/c 7^ I and diagonal elements given by 11^ = /3,Vfc. /3 is a 
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Figure 1: Mean of mutual information between estimated and true cluster membership matrices 
using 20 simulated graphs for each value of /3 in {0.45, 0.43, . . . , 0.03, 0.01}, and with N = 100, K = 
5, e = 0.01 for the different algorithms greedy icl, vbmod, colsbm and mixer. 

complexity tuning parameter which ranges from 0.45 to 0.01. When reaches 0.01, 
the model is not identifiable (the connectivity matrix is constant) and the true cluster 
memberships cannot be recovered. The set of simulated problems is therefore of 
varying complexity: from problems with a clear structure when /3 = 0.45 to problems 
without any structure when = 0.01. The experiments are performed twenty times 
for each value of /3 and the average of the normalized mutual information over these 
twenty simulated graphs is depicted in Figure [l] (left) for all the algorithms. In 
order to produce results as comparable as possible, the parameters of the different 
algorithms were set as follows: vbmod, mixer and greedy icl were all started ten 
times and for each method the best run was selected according to the corresponding 
model selection criterion. The variational methods were run with K between 2 and 
20 and the best clustering kept as a final result. For greedy icl, the parameters of 
the prior 77°, (° and n° were set to 1 and K up fixed to twenty. Finally the collapsed 
Gibbs sampler was run for 250 000 iterations (more than twice the default value). 

The results illustrated in Figure [T] show that greedy icl outperforms the other 
methods for complex problems, i.e. low values of /3. The simulated clustering is 
recovered until (3 reaches 0.25. Above this value the different algorithms perform 
identically, but beyond this limit the results of greedy icl are a little bit better. 
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During the transition greedy icl gets slightly better results than the other algo- 
rithms, it is followed by colsbm and vbmod which give close results and mixer 
that deviates earlier from the planted clustering. 

4-2. Setting 2: small scale community structures with a hub cluster 

The second setting aims at exploring the performances of the methods when the 
latent structure does not correspond only to communities. To this end, graphs were 
generated using the stochastic block model with affiliation probability matrix II of 
the form as in |36|: 



h 


(3 ... 


... p\ 


p 


P e 


... e 


p 


e P 


... e 


& 


e ... 


13 £ 




e ... 


... P) 



The clusters correspond therefore to communities, except one cluster of hubs 
which connects with probability to all other clusters. Graphs with N = 100 
vertices, K = 5 clusters and a = (1/5, 1/5, 1/5, 1/5, 1/5) were generated using this 
connection pattern. The parameter e was set to 0.01 and (5 ranged as previously 
from 0.45 to 0.01. Eventually, the other simulation parameters did not change. The 
results are shown in Figure [l] (right). 

As expected, the vbmod algorithm, which looks only for communities, is strongly 
affected by this change of setting and systematically misses the hub cluster. For the 
remaining methods, the best results are achieved by greedy icl which still uncovers 
the planted clustering when /3 > 0.25, whereas mixer starts to drop at (3 equals 0.4. 
The collapsed Gibbs sampler achieves also good results in this setting, very close to 
those of greedy icl and out-performs mixer. 

4-3. Setting 3: medium scale community structures 

The third setting is a classical community simulation but with more nodes and 
clusters, in order to study the effect of these two parameters. Thus, the number of 
vertices was set to iV = 500 and the number of clusters to K = 10. The cluster 
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Figure 2: Mean of mutual information between estimated and true cluster membership matrices 
using 20 simulated graphs for each value of j3 in {0.45, 0.43, . . . , 0.03, 0.01}, and with TV = 500, K = 
10, e = 0.01 for the different algorithms greedy icl, vbmod, colsbm and mixer. 

proportions were denned as a = (1/10, . . . , 1/10) and all the other parameters kept 
the same value as previously. For this third experiment, the results presented in 
Figure [2] are identical for greedy icl, colsbm and mixer. The vbmod algorithm 
seems to be more affected by the dimensionality of the problem, and did not recover 
exactly the true clusters when j3 is under 0.2. The results obtained by the different 
algorithms in this setting are better than those obtained previously. This can easily 
be explained by the increase in the number of nodes per cluster. The transitions 
between high and low values of the normalized mutual information were also sharper 
than in the previous experiments, for the same reasons. 

4-4- Setting 4'- large scale problem with complex structure 

The final setting involves larger graphs with N = 10 000 vertices. The planted 
structure is also not a purely community pattern. Some interactions between clusters 
are activated randomly using a Bernoulli distribution as described by the following 
generative model: 

\ZU+(l-Z)e, lik^l 
Tl k i={ (6) 
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with Z ~ #(0.1), U ~ ZY(0.45) and e = 0.01. The size of the problem and the 
complex nature of the underlying structure, let only two algorithms able to deal with 
these graphs namely greedy icl and colsbm, since mixer cannot handle such large 
graphs and vbmod deals only with community structure. Both were used to cluster 
20 simulated graphs generated using this scheme. The greedy algorithm was started 
using K up = 100 and the same setting as previously for the prior distributions. 
The results presented as boxplots in Figure [3] give a clear advantage to greedy 
icl over the collapsed sampler. Thus, greedy icl achieves an average normalized 
mutual information of 0.88 whereas colsbm reaches only 0.67. In fact, the greedy 
solution ended with around 80 clusters for all the simulations whereas the Gibbs 
sampler gives more than 240 clusters in average and therefore produces highly over 
segmented partitions of the graphs. 



Colsbm - 




Greedy ICL - 

0.70 0.75 0.80 0.85 0.90 

normalized mutual information 

Figure 3: Mean of the mutual information between estimated and true cluster membership matrices 
using 20 simulated graphs with N = 10000 and K = 50. 

To summarize the results we obtained in all the experiments we carried out, it 
appears that greedy icl compares favourably with the other existing solutions for 
SBM, in all the settings. The results obtained in complex setting, i.e. large graphs 
and a complex underlying structure (Setting 4) are particularly encouraging since 
greedy icl clearly outperforms the collapsed Gibbs sampler. 

5. Real dataset: communities of blogs 

The proposed algorithm was finally tested on a real network where vertices cor- 
respond to blogs and edges to known hyperlinks between the blogs. All the blogs 
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considered are related to a common topic, i.e. illustrations and comics. 

The network was built using a community extraction procedure [16] which starts 
from known seeds and expands them to find a dense core of nodes surrounding them. 
It is made of 1360 blogs linked by 33 805 edges. The data set is expected to present 
specific patterns, namely communities, where two blogs of the same community 
are more likely to be connected that nodes of different communities. To test this 
hypothesis we used the greedy ICL algorithm and did a qualitative comparison of 
the results with those obtained with the community discovery method of Blondel et 
al. [H]. 

Starting with K up = 100 clusters, greedy ICL found K — 37 clusters. The 
corresponding clusters are illustrated in Figure [4] which is an image of the adjacency 
matrix with rows/columns sorted by cluster number. Thus, it appears that the 
clusters found correspond in their vast majority to small sub-communities. These 
sub- communities all correspond to known groups. For instance a group of blogs 
of illustrators for Disney was found. Other examples include clusters of blogs of 
students who went to the same illustration school such as the ECMA school of 
Angouleme or the "Gobelins Ecole de l'image". However, some clusters have more 
complex connectivity structures and are made of hubs which highly connect to blogs 
of different clusters. They correspond to blogs of famous writers such as Boulet. 

To give a qualitative idea of the interest of the found clustering, we also give 
the results obtained by the community discovery algorithm of Blondel et al. [H] 
in Figure [5j With this approach only 8 clusters are found, corresponding all to 
sub-communities. Clusters of hubs could not be recovered. The major difference 
between the number of clusters estimated by the two methods may be explained by 
two facts. Firstly, modularity is known to be prone to a resolution limit problem 
|17] which prevents such a solution to extract small scale structures. This explains 
why the small sub-community extracted by greedy icl are not recovered using the 
modularity. For the time being, the behaviour of the ICL ex criterion with respect 
to the resolution limit problem is not clear and will deserve further investigations. 
However, we notice that on this dataset finer structures than those obtained using 
modularity are recovered. Secondly, the difference in the way the two criteria use 



19 





















:■ ■ i 
































































Ill : 






















































































































































































































■ 



































































































































































Figure 4: Adjacency matrix of the network of blogs, the rows/columns are sorted by cluster number 
with clusters found by the greedy ICL algorithm. The cluster boundaries are depicted with white 
lines. 

degree correction or not [32] can also explain the disparity in the number of clusters. 
While modularity is a degree-corrected criterion which downscales the weights of the 
edges between highly connected vertices, the ICL ex criterion for the basic stochastic 
block model used here is not. Using a degree correction or not is a modelling choice 
which deserves to be validated and investigated; however, it seems that even without 
degree correction the results obtained by greedy icl are meaningful, the hub clusters 
being interesting per se. 

6. Conclusion 

In this paper, we showed how an analytical expression of the integrated complete 
data log likelihood could be derived using conjugate priors for the model parame- 
ters, and that no asymptotic approximations were required. We then proposed a 
greedy optimization algorithm to maximize this exact quantity. Starting from an 
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Figure 5: Adjacency matrix of the network of blogs, the rows/columns are sorted by cluster number 
with clusters found by modularity optimization. The clusters boundaries are depicted with white 
lines. 

over segmented partition, the approach simplifies the model, while clustering the 
vertices, until a local maximum is reached. This greedy algorithm has a competitive 
complexity and may handle networks with tens of thousands of vertices. We illus- 
trated on simulated data that the method improves over existing graph clustering 
algorithms, both in terms of model selection and clustering of the vertices. A quali- 
tative comparison between methods was also carried out on an original network we 
built from blogs related to illustration, comics, and animations. 

We emphasize that the methodology we considered can be adapted to other 
mixture models. In particular, we will investigate the case of the degree corrected 
stochastic block model which have been shown to give promising results on real data. 
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APPENDIX 



Appendix A. Marginal distributions 

Proposition Appendix A.l. The marginal distribution p(Z\K) is given by: 

where the components of the vector n are n& = n k + Y2iLi Zik, for all k in {1, ... , K} 
and the function C(-) is such that C(x) = ji^pjr-^^ for all x in W K . 

1 (Z^fc=l x k) 

Proof. 

N K 
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Proposition Appendix A. 2. The marginal distribution p(X.\Z, K) is given by: 



/ v i7 TT B(r)kii Cu) 
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where r} kl = rf kl + £ Z^Z^X^ and ( kl = Qi + E^y Z ikZji(l - X tj ) for all (k, I) 
{1, . . . , -ft'} 2 . T/ie function B(a, b) is such that B(a, b) = T ^!!+b) f or a ^ ( a ' ^) ^ n 

Proof. 

I N K Z Z \ K 
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Proposition Appendix A. 3. Using factorized and conjugate prior distributions 
over the model parameters, the integrated complete data log likelihood is given by: 

«™-f:*(Sfe§)+*(S)- 

where 

• Vkl = Vh + YnjLj ZikZjiXij for all (k,l) in {1, . . . , K} 2 

• Cm = C°h + E^j ZikZji{l - Xij) for all (k, I) in {1, . . . , K} 2 

• the components of the vector n are = n k ) + Ylf=i Zik, f or all k in {1, ... , K} 

• the function B(a, b) is such that B(a, b) = I ^+l)" > for all (a, b) in M 2 

• the function C(-) is such that C(x) = I^i- 1 "^'! for all x in M. K 

w v ' r(Eifc=i x k) 

Proof. Considering factorized prior distributions, the integrated complete data log 
likelihood decomposes into two terms: 

logp(X, Z\K) = log (^j p(X, Z, n, a\K)dadnj 

= log^y p(X|Z,n, K)p(U\K)dU J p(Z\a,K)p{a\K)daj ( A - 2 ) 
= logp(X|Z,A) + logp(Z|A). 



Using Propositions Appendix A.l and Appendix A. 2 in (A.2) gives: 
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Appendix B. Change in ICL induced by a swap movement i : g — >• h 

At each step of the greedy ICL algorithm, a single node i is considered. If % is 
currently in cluster g, the method tests every possible label swapping g — )■ h, that 
is removing i from cluster g and assigning it to a cluster h ^ g. The corresponding 
changes in the ICL ex criterion are denoted A^. In order to derive the calculation 
of each term A^^, for all h ^ g, we consider two cluster indicator matrices Z as 
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well as Z test . Z describes the current partition of the vertices in the network, while 
jtest re p resen t s the partition after applying the swap g — >■ h: 



ytest 



Z ik = 0,Vfc ^ g,h 



while 



Z\f = 0, Z ig = 1 



Z£* = 1, Z ih = 



Thus 



A a ^ h = ICL ex (Z test ,K test ) - ICL ex (Z,K). 

Note that A g ^ h takes two forms whether cluster g is empty after removing i or not. 
In the later scenario, the model dimensionality changes (i^* es * = K — 1) and this 
must be taken into account to evaluate the possible increase induced by the swap 
movement. 

Appendix B.l. Case 1 : £V Zj g st > 0. Cluster g not empty after removing i 

A - ft " log l^>rJ + ir log l J 
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with the changes in edges counter r\ k \ induced by the label swap: 



N N N 

7..Y 



hi = ^{k=h} ZjiXjj + i{i=h} z^x^ - t{k= g } z iX x., 
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Moreover, p^} is defined in the following: 

pjjj = {^{k=h} ~ l{fc= 9 }) fa ~ n °i ~ Z u) + ( 1 {«=M - 1 {i=g}) K ~n° k - Z ik ) - 5 



These update quantities can be computed in 0(li) with l { the degree of i (total 
number of edges from and to i). Therefore the complexity of finding the best swap 
movement for a node is Oil + K 2 ), I for computing the 8$ and K 2 to compute the 
^■swap with all the possible h labels and keep the best one. 

Appendix B.2. Case 2 : Yli Z ig St = 0, cluster g disappear 

In this case the dimensionality of n° changes and we will denote by n°* = 
(n°, . . . , n°) the corresponding vector of size K — 1. 

'C(n°)C(n test y 



= log 



C(n) C(n *) 
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the complexity in this case is the same as previously i.e. Oil + K 2 ). 
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Appendix C. Change in ICL induced by a merge movement 
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Moreover, p^J is defined in the following: 



Pkl = l{fc=h}(C 5 « - Cgl) + ^{l=h}((kg ~ Ckg) + l{fc=handi=h}(C»a ~ Cgg)- ( C - 2 ) 
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